工作流程: 1.问题定义 2.获取训练集和测试集 3.清洗数据,做好准备 4.分析并识别模式,探索数据 5.建立模型,预测并解决问题 6.可视化解决问题的步骤以及最终的解决方案 7.提交答案

工作目标:分类,将样本进行分类,考虑其与目标类的相关性;相关性,特征与目标的相关性;转换,将特征值转化为符合模型的类型;补充,补充存在的缺失值;修正,修正错误的特征值;创造,创造新的特征值;图表化,选择正确的图表。

开始真正的测试


In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

获取数据集


In [2]:
train_df = pd.read_csv(r'E:\song_ws\data\kaggle\Titanic\train.csv')
test_df = pd.read_csv(r'E:\song_ws\data\kaggle\Titanic\test.csv')
combine = [train_df, test_df]

数据集的特征


In [3]:
print(train_df.columns.values)


['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

PassagerId:乘客Id编号,没有实际意义 Survived:乘客最后是否存活,0代表No,1代表Yes Pclass:船票等级,1代表高级Upper,2代表中级Middle,3代表低级Lower Sex: 性别 Age:年龄,代为年,如果小于1,则年龄是小数,如果年龄是估算的,则可能是xx.5的形式 SibSp:在船上的兄弟或者配偶,Sibling包括兄弟,姐妹,法律意义上的兄弟姐妹 Parch:在船上的父母或者孩子 Ticket:船票的编号 Fare:旅客票价 Cabin:船舱号 Embarked:登陆港口,C代表Cherbourg,Q代表Queenstown,S代表Southampton

Categorical 特征:Survived,Sex,Embarked,Ordinal:Pclass。 Continue 特征:Age,Fare,Discrete:SibSp,Parch


In [4]:
# preview the data
train_df.head()


Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [5]:
train_df.tail()


Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

In [6]:
train_df.info()
print('_'*40)
test_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Ticket是数字加字母的混合数据,Cabin是字母连着数字。 Name有可能存在拼写错误。 训练集中Cabin>Age>Embarked存在缺失值 测试集中Cabin>Age存在缺失值。

数据的分布情况


In [7]:
train_df.describe()


Out[7]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [8]:
train_df.describe(include=['O'])


Out[8]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Klasen, Mr. Klas Albin male CA. 2343 G6 S
freq 1 577 7 4 644

分析特征


In [9]:
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Out[9]:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

In [10]:
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Out[10]:
Sex Survived
0 female 0.742038
1 male 0.188908

In [11]:
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Out[11]:
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000

In [12]:
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Out[12]:
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000

In [13]:
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Out[13]:
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.336957

1.Pclass不同的类别明显拥有不同的存活概率,等级越高,存活率越大 2.Sex:明显女性的存活率远高于男性 3.Sibsp和Parch对存活率没有明显的相关性 4.Embarked:不同的港口登陆存活率稍微有些差别,C港口的存活率明显高一些。


In [14]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)


Out[14]:
<seaborn.axisgrid.FacetGrid at 0xc28e908>

1.婴儿有较高的存活率 2.八十岁的老人都存活了 3.大部分的乘客在15-35之间 4.死亡率最大的在15-25之间


In [15]:
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()


Out[15]:
<seaborn.axisgrid.FacetGrid at 0xc806f28>

1.大部分的乘客都是pclass=3,但是死亡率最高 2.Pclass=2的婴儿都存活了 3.大部分的pclass=1的人存活了


In [16]:
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()


Out[16]:
<seaborn.axisgrid.FacetGrid at 0xcdf6e48>

1、女人的存活率普遍高于男性 2、在Embarked=C中,女性存活率低于男性,不能代表Embarked和Survived有直接关系 3、Embarked不同,存活率也是不同


In [17]:
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()


Out[17]:
<seaborn.axisgrid.FacetGrid at 0xcfb6828>

In [ ]: